Quality Control

Author

Will Gammerdinger, Noor Sohail, Zhu Zhuo, James Billingsley, Shannan Ho Sui

Published

July 22, 2025

Approximate time: XY minutes

Learning Objectives

  • Construct quality control metrics and visually evaluate the quality of the data
  • Apply appropriate filters to remove low quality cells
  • Create a filtered seurat object

Quality Metrics

In Visium HD data, the main challenge is in delineating bins that are poor quality from bins containing reads from less complex cells. If you expect a particular cell type in your dataset to be less transcriptionally active as compared other cell types in your dataset, the bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.

We will assess a variety of metrics to evaluate which bins are considered low/high quality. We will apply very permissive filtering here as it has been shown that low expression can be biologically meaningful for spatial context so we won’t be as stringent as we normally are with scRNA-seq.

The metrics we will be using to filter low-quality bins from high-quality ones include:

  • Cell counts
  • UMI counts per cell
  • Genes detected per cell
  • Complexity (novelty score)
  • Mitochondrial counts ratio

Some of these values we will calculate throughout this lesson and be added to our metadata dataframe.

meta <- seurat_merged@meta.data
View(meta)

We will be using a variety of visualization methods, including looking at the values on the spatial slide as well as the distribution of values before and after filtration.

For “nicer” (subjective) plotting with SpatialFeaturePlot() and SpatialDimPlot(), we will add some extra parameters to gain a clearer image beyond the default plot (shown here):

SpatialFeaturePlot(seurat_merged, 
                   "nCount_Spatial.008um")
Figure 1: Default SpatialFeaturePlot() visualization parameters

Therefore for the rest of this lesson, we will be using the following arguments:

  • pt.size.factor = 3: to clearly see each bin on the slide
  • image.alpha = 0: to remove the H&E stained image in the background of the image
  • max.cutoff and min.cutoff: to not allow the color scale to be driven by smaller populations of cells with high/low values

Number of Cells

ggplot(meta) +
  geom_bar(aes(x = orig.ident, fill = orig.ident),
           color = "black") +
  geom_text(aes(x = orig.ident, label=after_stat(count)), 
            stat='count', vjust=-1) +
  theme_classic()
Figure 2: Number of cells in the dataset, split by sample

UMI Counts (Transcripts) per Cell

TODO

  • talk about celltypes and how we see clear structure of certain cells with higher nUMIs
  • Cut-off of 10 for both samples

We expect to see a bimodal distribution, with one peak representing bins containing lower-quality cells with fewer UMIs and another peak representing bins containing healthy cells with more UMIs. Ideally, the peak representing lower-quality and dying cells is small and the peak representing healthy cells is large.

This is the number of unique transcripts detected per bin. Because the bins are very small, this number is less than what we would expect for non-spatial scRNA-seq data.

SpatialFeaturePlot(seurat_merged, 
                   "nCount_Spatial.008um",
                   pt.size.factor = 3,
                   image.alpha = 0,
                   max.cutoff = 2000)
Figure 3: Number of UMIs overlaid over spatial slide
ggplot(meta) +
  geom_density(aes(x = nCount_Spatial.008um, fill = orig.ident),
               alpha = 0.4,
               color = "black") +
  geom_vline(xintercept = 10, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()
Figure 4: Number of UMIs density
meta_filt <- subset(meta,
  ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 10)) |
  ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))

ggplot(meta_filt) +
  geom_density(aes(x = nCount_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 10, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()
Figure 5: Number of UMIs density after filtration

A commonly asked question is if you have to use the same threshold values across all your samples. We recommend that you follow what the data is telling you and do it on a per sample basis. Ultimately the end goal is to retain high-quality bins, even if that means using different values for each sample.

Genes Detected per Cell

We have similar expectations for gene detection as for UMI detection, although it may be a bit lower than UMIs.

This is the number of unique genes detected per bin. Again, because the bins are very small, this number is less than what we would expect for non-spatial scRNA-seq data.

SpatialFeaturePlot(seurat_merged, 
                   "nFeature_Spatial.008um",
                   pt.size.factor = 3,
                   image.alpha = 0,
                   max.cutoff = 1200)
Figure 6: Number of features overlaid over spatial slide
ggplot(meta) +
  geom_density(aes(x = nFeature_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 15, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()
Figure 7: Number of features density
meta_filt <- subset(meta,
  ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 15)) |
  ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))

ggplot(meta_filt) +
  geom_density(aes(x = nFeature_Spatial.008um,
                     fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 15, color = "pink") +
  geom_vline(xintercept = 10, color = "lightblue") +
  scale_x_log10() +
  theme_classic()
Figure 8: Number of features density after filtration

Complexity (Novelty) Score

If there are many captured transcripts (high nUMI) and a low number of genes detected in a bin, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) bins could represent a specific cell type (i.e. red blood cells, which lack a typical transcriptome), or could be due to an artifact or contamination. Generally, we expect the complexity score to be above 0.80 for good-quality bins.

The novelty score is computed as shown below:

\[ \text{Complexity Score} = \frac{\text{Number of Genes}}{\text{Number of UMIs}} \]

Which we can now calculate using R and store in our @meta.data:

# Add number of genes per UMI for each cell to metadata
seurat_merged$log10GenesPerUMI <- log10(seurat_merged$nFeature_Spatial.008um) / 
                                  log10(seurat_merged$nCount_Spatial.008um)
SpatialFeaturePlot(seurat_merged, 
                   "log10GenesPerUMI",
                   pt.size.factor = 3,
                   image.alpha = 0,
                   min.cutoff = 0.7)
Figure 9: Complexity score overlaid over spatial slide
meta <- seurat_merged@meta.data
ggplot(meta) +
  geom_density(aes(x = log10GenesPerUMI,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.80) +
  theme_classic()
Figure 10: Complexity score density
meta_filt <- subset(meta, log10GenesPerUMI > 0.80)
ggplot(meta_filt) +
  geom_density(aes(x = log10GenesPerUMI,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.80) +
  theme_classic()
Figure 11: Complexity score density after filtration

Mitochondrial Counts Ratio

This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor-quality samples for mitochondrial counts as bins which surpass the 0.2 mitochondrial ratio threshold, unless of course you are expecting this in your sample. This ratio is computed as:

\[ \text{Mitochondrial Ratio} = \frac{\text{Number of reads aligning to mitochondrial genes}} {\text{Total reads}} \]

While using a baseline score of 0.2 is an acceptable threshold for removing high mitochondrial content cells, it is important to always go back to your original biological question. What samples are you working with? Do you expect there to be high values of mitochondria due to your experimental condition?

For example, if you were studying renal oncocytomas would you make this same choice? This disease is characterized as having aberrantly high mitochondrial expression, would it make sense to remove cells with high mitochondrial ratio?

# Compute percent mito ratio
seurat_merged$mitoRatio <- PercentageFeatureSet(object = seurat_merged, 
                                                pattern = "^MT-")
seurat_merged$mitoRatio <- seurat_merged@meta.data$mitoRatio / 100
SpatialFeaturePlot(seurat_merged, 
                   "mitoRatio",
                   pt.size.factor = 3,
                   image.alpha = 0,
                   max.cutoff = 0.5)
Figure 12: Mitochondrial ratio overlaid over spatial slide
meta <- seurat_merged@meta.data
ggplot(meta) +
  geom_density(aes(x = mitoRatio,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.25) +
  theme_classic()
Figure 13: Mitochondrial ratio density
meta_filt <- subset(meta, mitoRatio < 0.25)

ggplot(meta_filt) +
  geom_density(aes(x = mitoRatio,
                   fill = orig.ident),
                 alpha = 0.4,
                 color = "black") +
  geom_vline(xintercept = 0.25) +
  theme_classic()
Figure 14: Mitochondrial ratio density after filtration
  1. Do you notice a pattern in cells in regards to the number of UMIs and feature? Make a geom_point plots to compare these values on a per-cell basis and color each point by the mitochondrial ratio following the structure provided here:
# Structure for making geom_point plot
# Fill in values to answer the question
seurat_merged@meta.data %>%
  # Sorting by mitoRatio to make high scores appear on top of the plot
  arrange(mitoRatio) %>%
  ggplot() +
  geom_point(aes(x = ?, 
                 y = ?,
                 color = ?),
             size = 0.5) +
  # Setting limits so that outliers don't determine scale of the plot
  ylim(0, 3500) + xlim(0, 3500) +
  theme_bw()

Filter

We will apply very minimal filtering here. It has been shown that low expression can be biologically meaningful for spatial context so we won’t be as stringent as we normally are with scRNA-seq.

# TODO there has to be a cleaner way to do this
seurat_filtered <- subset(seurat_merged,
  ((orig.ident == "P5CRC") & (nCount_Spatial.008um > 10)) |
  ((orig.ident == "P5NAT") & (nCount_Spatial.008um > 10)))

seurat_filtered <- subset(seurat_filtered,
  ((orig.ident == "P5CRC") & (nFeature_Spatial.008um > 15)) |
  ((orig.ident == "P5NAT") & (nFeature_Spatial.008um > 10)))
  
seurat_filtered <- subset(seurat_filtered, mitoRatio < 0.25)
seurat_filtered <- subset(seurat_filtered, log10GenesPerUMI > 0.80)

seurat_filtered
An object of class Seurat 
18085 features across 826613 samples within 1 assay 
Active assay: Spatial.008um (18085 features, 0 variable features)
 1 layer present: counts
 2 spatial fields of view present: P5CRC.008um P5NAT.008um

How many cells did we remove in this filtration process?

ncol(seurat_merged) - ncol(seurat_filtered)
[1] 151128
  1. How many bins do we have per sample after this filtration step?

Visualizing Counts Data

We can visualize the number of UMIs and gene counts per bin, both as a distribution and layered on top of the tissue image.

Violin Plots

Let’s start with a violin plot to look at the distribution of UMI counts and gene counts. The input is our post-filtered dataset.

p_ncount <- VlnPlot(seurat_filtered, 
                    features = "nCount_Spatial.008um", 
                    pt.size = 0, group.by = 'orig.ident') +
  NoLegend()

p_nfeats <- VlnPlot(seurat_filtered, 
                    features = "nFeature_Spatial.008um", 
                    pt.size = 0, group.by = 'orig.ident') + 
  NoLegend()

p_ncount | p_nfeats
Figure 15: Violin plot of nCounts and nFeatures after filtration

We see that both distributions have a similar peak but the nUMI distribution has a much longer tail. This is expected, because while the small physical size of the bins means that most genes will be detected only once or twice, a minority of bins under very transcriptionally active cells may exhibit multiple transcripts of the same gene.

Spatial Overlay

Next, we can look at the same metrics and the distribution on the actual image itself. Note that many spots have very few counts, in part due to low cellular density or cell types with low complexity in certain tissue regions.

SpatialFeaturePlot(seurat_filtered, 
                   c("nFeature_Spatial.008um", "nCount_Spatial.008um"),
                   pt.size.factor = 3,
                   image.alpha = 0)
Figure 16: Number of features and counts overlaid over spatial slide

Save!

Now is a great spot to save our seurat_filtered object as we have finished filtered.

# Save Seurat object
saveRDS(seurat_filtered, "data/seurat_filtered.RDS")